I chose to subset this data and only look at the most recent year (2011) due to the enormous size of the original dataset with over a million records. This new raw dataset contains just over 300,000 records.
Each record represents one collision report per person involved in a collision in Canada. So if there were 4 people involved in a collision there would be 4 separate records.
## [1] 301724
## [1] 23
## [1] "X" "C_YEAR" "C_MNTH" "C_WDAY" "C_HOUR" "C_SEV" "C_VEHS"
## [8] "C_CONF" "C_RCFG" "C_WTHR" "C_RSUR" "C_RALN" "C_TRAF" "V_ID"
## [15] "V_TYPE" "V_YEAR" "P_ID" "P_SEX" "P_AGE" "P_PSN" "P_ISEV"
## [22] "P_SAFE" "P_USER"
## 'data.frame': 301724 obs. of 23 variables:
## $ X : int 4598867 4598868 4598869 4598870 4598871 4598872 4598873 4598874 4598875 4598876 ...
## $ C_YEAR: int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ C_MNTH: Factor w/ 13 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ C_WDAY: Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ C_HOUR: Factor w/ 25 levels "00","01","02",..: 11 13 1 18 18 23 17 17 14 14 ...
## $ C_SEV : int 2 2 2 2 2 2 2 2 2 2 ...
## $ C_VEHS: Factor w/ 21 levels "01","02","03",..: 1 1 1 2 2 1 2 2 2 2 ...
## $ C_CONF: Factor w/ 20 levels "01","02","03",..: 2 4 3 7 7 4 16 16 16 16 ...
## $ C_RCFG: Factor w/ 12 levels "01","02","03",..: 3 12 12 12 12 5 2 2 2 2 ...
## $ C_WTHR: Factor w/ 9 levels "1","2","3","4",..: 1 1 7 1 1 1 4 4 1 1 ...
## $ C_RSUR: Factor w/ 11 levels "1","2","3","4",..: 3 5 3 1 1 3 3 3 1 1 ...
## $ C_RALN: Factor w/ 8 levels "1","2","3","4",..: 2 1 1 1 1 1 4 4 1 1 ...
## $ C_TRAF: Factor w/ 19 levels "01","02","03",..: 17 19 19 17 17 17 15 15 17 17 ...
## $ V_ID : Factor w/ 60 levels "01","02","03",..: 1 1 1 1 2 1 1 2 1 2 ...
## $ V_TYPE: Factor w/ 20 levels "01","05","06",..: 1 1 1 1 1 1 1 1 1 14 ...
## $ V_YEAR: Factor w/ 81 levels "1910","1911",..: 81 81 81 81 81 81 81 81 81 81 ...
## $ P_ID : Factor w/ 65 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ P_SEX : Factor w/ 4 levels "F","M","N","U": 2 1 1 1 2 1 2 1 1 2 ...
## $ P_AGE : Factor w/ 101 levels "01","02","03",..: 75 21 34 50 63 26 34 20 80 50 ...
## $ P_PSN : Factor w/ 16 levels "11","12","13",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ P_ISEV: Factor w/ 5 levels "1","2","3","N",..: 2 2 2 2 1 2 2 2 2 1 ...
## $ P_SAFE: Factor w/ 10 levels "01","02","09",..: 8 2 2 2 8 2 2 2 2 8 ...
## $ P_USER: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 6 ...
## X C_YEAR C_MNTH C_WDAY
## Min. :4598867 Min. :2011 07 : 29320 5 :49913
## 1st Qu.:4674298 1st Qu.:2011 08 : 29203 4 :46053
## Median :4749728 Median :2011 01 : 28799 3 :44428
## Mean :4749728 Mean :2011 06 : 28194 6 :44106
## 3rd Qu.:4825159 3rd Qu.:2011 09 : 26521 2 :42602
## Max. :4900590 Max. :2011 05 : 24760 1 :39503
## (Other):134927 (Other):35119
## C_HOUR C_SEV C_VEHS C_CONF
## 16 : 26576 Min. :1.000 02 :184411 21 :89896
## 15 : 25671 1st Qu.:2.000 01 : 66643 35 :39812
## 17 : 25541 Median :2.000 03 : 36476 06 :25445
## 14 : 20397 Mean :1.983 04 : 9852 36 :23836
## 12 : 19618 3rd Qu.:2.000 05 : 2409 33 :21836
## 18 : 19320 Max. :2.000 06 : 813 QQ :18475
## (Other):164601 (Other): 1120 (Other):82424
## C_RCFG C_WTHR C_RSUR C_RALN
## 02 :140705 1 :208146 1 :196596 1 :219870
## 01 :110967 2 : 33011 2 : 55230 2 : 25567
## UU : 22287 3 : 29662 5 : 16788 U : 19823
## 03 : 13133 4 : 19736 3 : 13319 3 : 18336
## QQ : 8985 6 : 4313 Q : 10257 4 : 10160
## 05 : 2746 U : 4079 4 : 4846 5 : 4163
## (Other): 2901 (Other): 2777 (Other): 4688 (Other): 3805
## C_TRAF V_ID V_TYPE V_YEAR
## 18 :156071 01 :160208 01 :245756 2007 : 21771
## 01 : 84044 02 :109102 NN : 13503 2005 : 20627
## 03 : 32008 03 : 15048 14 : 7464 2008 : 19684
## UU : 13609 99 : 12819 17 : 6469 2003 : 19357
## 04 : 4687 04 : 3161 06 : 5806 2006 : 19314
## QQ : 4052 05 : 706 07 : 5275 2010 : 18279
## (Other): 7253 (Other): 680 (Other): 17451 (Other):182692
## P_ID P_SEX P_AGE P_PSN P_ISEV
## 01 :216999 F:126255 UU : 23368 11 :200139 1:119265
## 02 : 57349 M:157804 19 : 8027 13 : 41526 2:155761
## 03 : 16713 N: 839 18 : 7870 23 : 12746 3: 2006
## 04 : 6289 U: 16826 20 : 7668 99 : 11664 N: 17489
## 05 : 1989 21 : 7379 21 : 10664 U: 7203
## NN : 617 17 : 6865 UU : 6990
## (Other): 1768 (Other):240547 (Other): 17995
## P_SAFE P_USER
## 02 :211675 1:185197
## UU : 31933 2: 74796
## NN : 30159 3: 12819
## 01 : 8907 4: 6469
## 13 : 7626 5: 7464
## 09 : 7535 U: 14979
## (Other): 3889
There are three kinds of variables in this dataset: collision-level, vehicle-level, and person-level.
From an initial look a the summary output, Friday (5) appears to be the weekday with the most number of collisions reported. The 4-5 PM Hour (16) is the most dangerous time for motorists. Most collisions involve 2 vehicles (C_VEHS). Rear-end collisions (21) are the most common types of accidents (C_CONF) and most accidents happen at an intersection (02) (C_RCFG). Nothing out of the ordinary so far. However, perhaps surprisingly, most accidents actually occur on clear and sunny days by a wide margin (1) (C_WTHR) and on dry, normal (1) (C_RSUR) straight and level (1) (C_RALN) road surfaces. Finally, most collisions occur where no traffic controls (such as stop signs, police officers, reduced speed zone, etc.) are present.
Most collisions involved light duty vehicles such as passenger cars, vans, and light duty pickup trucks (01) (V_TYPE). Most collisions involve more recent model vehicles, but there isn’t a large variance between years (V_YEAR).
Men are involved in more collisions than women. Many people don’t report their age (UU = Unknown) (P_AGE), but from the ones that do, most consist of young adults and teenagers (17-21 years old). About 2/3rds of the reported collisions (each record) are from drivers (11) themselves instead of passengers (P_PSN). Only 0.66 % of reported collisions involved a fatality (3), while the rest of the collisions were split mostly between injuries/no injuries (P_ISEV). In the vast majority of incidents, a safety device such as a seat belt or helmet (for motorcycles, bicyclists, snowmobile, and ATV riders) was worn (02) (P_SAFE). This highlights the fact that there are different categories of people reporting these collisions, but the vast majority are vehicle drivers (1) and passengers (2) with pedestrians (3) a distant third (P_USER).
Looks like there is something unusual with this plot towards the right tail.
## Var1 Freq
## 100 NN 1029
## 101 UU 23368
## UU 19 18 20 21 17 22 23 25 24
## 23368 8027 7870 7668 7379 6865 6661 6315 6062 5999
Looks like a lot of people did not report their age (UU) or was not applicable (NN) (e.g. “dummy” person record created for parked cars). We’ll go ahead and remove these records from the dataset and also clean up the gender variable to show only males and females.
We are now looking at 275270 records instead of just over three hundred thousand from the original data. We got rid of under 10% of the records and it shouldn’t affect our ultimate analysis.
This chart now shows the distribution of ALL people involved in collisions and we can see that it is bimodal with one peak near later teenagers/young adults and another what looks like people in their 40s. Let’s confirm that.
From this chart (age > 15), we can see more precisely that 45-50 is where the second peak is. But, as per the original goal of this analysis, we are really interested in the age of Drivers only.
We do see a similar distribution when only considering Drivers only. We now see that the left tail of the distribution seems flat as we don’t expect mqny children to be driving (at least in Canada!)
Are more men or women behind the wheel before an accident?
Contrary to this reddit joke (in bad taste I may add) men are actually involved in more collisions than women overall.
Is this difference between the sexes still present when we look at deadly crashes only?
Yes, and the contrast is much more pronounced. There are relatively a few deaths per year (thank god!), but in 3 out of every 4 collisions that do involve a death, a man was driving the vehicle.
Let’s shift gears (no pun intended) and look at collisions in regards to time.
Looks like Friday (5) has the most accidents (an increasing trend from Monday - perhaps people are in a rush to celebrate the weekend!) and Sunday has the least amount of accidents (this could be attributed to the fact that very few people work on Sundays so there might be less vehicles commuting to work overall).
Looks like deadly collisions happen the most on Saturdays (6) as opposed to Fridays (5) for non-deadly crashes.
Not surprisingly, most collisions occur during 4:00-5:00 PM (not surprisingly many people are heading home after work during this hour).
For deadly collisions, a larger proportion occur during the sleeping hours (from midnight to before the start of the workday).
Looks like January (01) and the summer months (07 & 08) have the most number of reported collisions (could be that more people are driving in the summer for the holidays).
Not really, the absolute counts are less, but the distribution is quite similar.
Speaking of drivers, we haven’t yet discussed how much of this dataset consists of collision reports from drivers as opposed to passengers, pedestrians, etc.
The vast majority of this dataset consists of reports from drivers (1) and passengers (2).
Perhaps counterintuitively, most accidents occur on sunny days (1). At first, you might have thought that more accidents occur in bad weather.
A larger proportion of the accidents that motorcyclists are involved in occur on sunny days (1) than compared with motor vehicle drivers (this could simply be due to the fact that a lot of motorcyclists ride only in the spring and summer months with better weather conditions).
Most collisions involve, not surprisingly, 2 vehicles.
## [1] 2
Let’s cleanup the plot a bit and only look at vehicles manufactured after 1980.
## [1] 2
Doesn’t seem to be any clear trend other than most vehicles involved in accidents are from most recent models (it could simply be there are more newer vehicles on the road). Looks like a dead end here (pun intended!).
Okay, so now the type of vehicle you are driving has to be a factor in determining whether deaths are involved in collisions right? Let’s see.
Looks like the vast majority of vehicles involved in accidents are Light Duty Vehicles (1).
## [1] 2
Now remember we are now looking at drivers only; we see that a larger proportion of Trucks (07) & Road Tractors (08) are involved in deadly crashes. This may be one variable to help predict deadly crashes.
The main features of interest in the data set are C_SEV (the severity of collisions - injury, no injury, or fatality) and P_AGE (the age of person involved in collision). I suspected that driver’s age and some other combination of variables could help predict deadly collisions.
In addition to the age of the driver, I originally thought that vehicle age (V_YEAR), vehicle type (V_TYPE), and gender (P_SEX) would be a larger factor in determining the fatality in collisions. The plots support this initial hunch except for vehicle age which doesn’t seem to vary between deadly and non-deadly collisions. I will look further at age in the bivariate analysis to see if their is a difference between deadly and non-deadly collisions.
Yes, I did create new variables just so that the data could be subset since the original variables were factors (categorical) and could not be subsetted otherwise (P_AGE_NUM & V_YEAR_NUM). I didn’t want to directly convert the original columns in case I needed the factor variables at a later time.
Yes, I did adjust the form of the data to show only the most recent year of data as there were millions of total records in the initial dataset. As mentioned earlier, I subset 300,000 records and then I removed data that had unknown or missing gender values as I suspected that the sex of the driver in fatal collisions could be skewed towards one sex (as confirmed by a histogram plot).
Before we get to bivariate plots, let’s do some additional data wrangling.
## X C_SEV P_AGE_NUM V_YEAR_NUM
## X 1.00 -0.02 0.01 NA
## C_SEV -0.02 1.00 -0.02 NA
## P_AGE_NUM 0.01 -0.02 1.00 NA
## V_YEAR_NUM NA NA NA 1
This initial correlation table with the current numeric variables doesn’t show us much valuable info. Let’s see what other variables can be considered numeric so we can create a more useful correlation table.
df$C_MNTH_NUM <- as.numeric(df$C_MNTH);
df$C_WDAY_NUM <- as.numeric(df$C_WDAY)
df$C_HOUR_NUM <- as.numeric(df$C_HOUR);
df$C_VEHS_NUM <- as.numeric(df$C_VEHS) #number of vehicles
df$C_WTHR_NUM <- as.numeric(df$C_WTHR) #higher value = worse weather
df$C_RSUR_NUM <- as.numeric(df$C_RSUR) #higher value = worse road conditions
#Create a new variable, M (Mortality), 1 = Died, 0 = Survived
df$M <- ifelse(df$C_SEV == 1, 1, 0)
#However, some 0s are unknown values that need to be removed
We have created 6 new numeric variables from the original factor variable and new binary variable called M (Mortality). 1 means at least one person died immediately after the collision or went to the hospital and died as a result of his/her injuries. 0 is everyone involved survived the crash. However, this new M variable is based off of the C_SEV variable we looked at and there may be some unknown variables that still need to remove.
There happen to be no such unknown records for 2011. We will also further subset the data to only look at drivers since that’s what our original question asked: what features (related to drivers, their vehicles, and the collision itself) are common in accidents that result in fatalities?
Now we have reduced the records by just under 100,000 (175925 records) and we can look at the correlation matrix now.
## X C_YEAR C_SEV P_AGE_NUM V_YEAR_NUM C_MNTH_NUM C_WDAY_NUM
## X 1.00 NA -0.02 0.01 NA 1.00 0.08
## C_YEAR NA 1 NA NA NA NA NA
## C_SEV -0.02 NA 1.00 -0.02 NA -0.02 -0.02
## P_AGE_NUM 0.01 NA -0.02 1.00 NA 0.02 -0.04
## V_YEAR_NUM NA NA NA NA 1 NA NA
## C_MNTH_NUM 1.00 NA -0.02 0.02 NA 1.00 0.00
## C_WDAY_NUM 0.08 NA -0.02 -0.04 NA 0.00 1.00
## C_HOUR_NUM 0.03 NA 0.01 -0.01 NA 0.03 0.00
## C_VEHS_NUM -0.04 NA -0.02 0.02 NA -0.04 -0.03
## C_WTHR_NUM -0.08 NA -0.01 -0.03 NA -0.08 0.01
## C_RSUR_NUM -0.17 NA -0.01 -0.05 NA -0.17 0.01
## M 0.02 NA -1.00 0.02 NA 0.02 0.02
## C_HOUR_NUM C_VEHS_NUM C_WTHR_NUM C_RSUR_NUM M
## X 0.03 -0.04 -0.08 -0.17 0.02
## C_YEAR NA NA NA NA NA
## C_SEV 0.01 -0.02 -0.01 -0.01 -1.00
## P_AGE_NUM -0.01 0.02 -0.03 -0.05 0.02
## V_YEAR_NUM NA NA NA NA NA
## C_MNTH_NUM 0.03 -0.04 -0.08 -0.17 0.02
## C_WDAY_NUM 0.00 -0.03 0.01 0.01 0.02
## C_HOUR_NUM 1.00 0.01 -0.03 -0.04 -0.01
## C_VEHS_NUM 0.01 1.00 0.02 -0.02 0.02
## C_WTHR_NUM -0.03 0.02 1.00 0.43 0.01
## C_RSUR_NUM -0.04 -0.02 0.43 1.00 0.01
## M -0.01 0.02 0.01 0.01 1.00
Definitely, there doesn’t seem to be any linear relationships here. Except for a decent correlation between Weather & Road Surface (C_WTHR_NUM & C_RSUR_NUM) of 0.43
No clear relationship here. Looks like there are quite a few accidents when the weather is really bad and when the road surface is at its worst (top right hand corner).
Let’s see if we can get any inspiration to compare other bivariate relationships with a pairwise plot of some selected variables.
Let’s see what other relationships we can explore further.
Looks like newer cars are involved in most accidents whether deadly (1) or not (0).
Looks like many deadly crashes had no safety device such as a seat belt used (light pink) (this is consistent across the major types of vehicles).
Let’s look at the age distribution of drivers of different types of vehicles.
Nothing out of the ordinary, we can see that school bus drivers (9) and RV drivers (18) skew towards older drivers.
Let’s now explore what traffic controls appear to be more effective than others.
From the top plot it appears as though no controls (18) being present would be the predominant traffic control when deadly accidents occur. But looking at the proportions, it looks as though the largest proportion of deaths actually occur where warning signs (5) are present.
Once again we have to remember that we are looking at the difference between deadly and non-deadly crashes.
This cumulative density plot shows teenage drivers and elderly drivers are overrepresented in collision deaths.
Head on collisions (31) make up a huge proportion of deadly crashes.
Railroad crossings (4) and passing lanes (7) are the type of road configurations with the largest proportion of deadly crashes.
Looks like distribution across the weekdays is bimodal throughout the “non-sleeping” hours (7 AM to Midnight), while the distribution on the weekend is almost identical in shape for Saturday & Sunday (being nearly unimodal for this same hour range).
Does age distribution vary across genders?
Not quite, the shape of both age distributions is very similar across sexes.
Railroad crossings (4) and passing lanes (7) are the type of road configurations with the largest proportion of deadly crashes. Head on collisions (31) make up a huge proportion of deadly crashes. Interestingly, warning signs represent the traffic control category that has the largest proportion of deadly crashes.
Yes, there was an interesting relationship between hour and weekday. Looks like distribution across the weekdays is bimodal throughout the non-sleeping hours (7 AM-Midnight), while the distribution on the weekend is almost identical in shape for Saturday & Sunday (being nearly unimodal for this same hour range).
The strongest relationship I found was between Weather (C_WTH_NUM) and Road Surface (CRSUR_NUM). There was a correlation of 0.43 between these two features. However, when plotting these two discrete values, no interesting trend was observed.
No clear trend can be observed when plotting vehicle year, person’s age, and sex except what was seen before in an earlier plot that the vast majority of vehicles involved in collisions are newer vehicles.
It’s interesting to see that the age distribution of school bus drivers (09) and city bus drivers (11) is about the same (about 25 to 75). However, school bus drivers have about the same number of women and men involved in accidents (women slightly more), while city bus drivers have predominately more males in accidents.
Another dead end, no visible relatioship between month, vehicle type, and mortality (deadly or non-deadly crashes).
The only real observation that was observed was that the age distribution of school bus drivers (09) and city bus drivers (11) is about the same (about 25 to 75). However, school bus drivers have about the same number of women and men involved in accidents (women slightly more), while city bus drivers have predominately more males in accidents.
Nothing surprising, but I did expect to see more multivariate relationships than what I found.
I wasn’t able to satisfactorily answer the original question I posed as I need more statistical proof that the variables I suspected as being predictive of deadly crashes (due to the high proportions of these variables in deadly crashes as shown by the plots) are not due to just chance.
The features I would initially choose if I were building a machine learning model to predict deadly crashes are: P_SEX, P_SAFE, V_TYPE and C_RCFG ; from the plots they appear to differ the most between deadly/non-deadly crashes.
Once I complete the next machine learning course, I will be able to see which classification model would be appropriate. What I did learn through my exploratory analysis is the linear model would not be appropriate in this case.
Here are some valuable things I learned throughout this process:
I learned that I would need to be very careful when subsetting data from a previous year in order to make it reusable. For example, I didn’t need to remove unknown variables for C_SEV because there were none for 2011, but that’s not necessarily true for other years. It is best to have the code in there assuming that the data has this missing data (even if it isn’t missing in the particular dataset you are currently working on).
It’s also tempting to stop exploring the data when you arrive at the answer that you originally expected from intuition. Its important to keep going as things may not be as they seem from just the initial look at the data or intuition.
I learned that a lot of the information about the data set could be gleaned just from the summary output (summmary(df)). However, many of the visualizations helped confirm these initial findings, but more importantly they helped in finding outliers and unusual relationships that were missed if you looked just at the summary. For example, it would not be able to see that older teenagers and elderly drivers are disproportionally responsible for deadly collisions.
I learned that it’s better to work with a smaller subset of the data in order to speed up execution of the code. For the pairwise plot, I used a sample of 10,000 records, but it still took several minutes to run this code even with less than 10% of the records.
It’s very important to choose the right type of plot or else you can get misleading results. For example, comparing weather and road surface by a simple scatterplot suffers from overplotting and it shows that wet roads and windy weather is associated with deadly crashes and all other combinations are associated with non-deadly crashes. But when you use jitter, the pattern is much more clear and you can see that this is not necessarily the case..
Finally, I want to discuss a frustration with plotting in R Studio vs the html output. If you use the zoom plot option in R Studio, you expect the final output to be the same. However, I found that although I could change the plot output width and height, I found the plots didn’t look as good as they did in R studio.